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Abstract —The problem of quickest detection of a change in the 
distribution of a n x p random matrix based on a sequence of 
observations having a single unknown change point is considered. 
The forms of the pre- and post-change distributions of the rows 
of the matrices are assumed to belong to the family of elliptically 
contoured densities with sparse dispersion matrices but are 
otherwise unknown. We propose a non-parametric stopping rule 
that is based on a novel summary statistic related to k-nearest 
neighbor correlation between columns of each observed random 
matrix. In the large scale regime of p —> oo and n fixed we show 
that, among all functions of the proposed summary statistic, 
the proposed stopping rule is asymptotically optimal under a 
minimax quickest change detection (QCD) model. 

I. Introduction 

In this paper we consider the problem of sequential detec¬ 
tion of a change in the distribution of a sequence of large 
scale random matrices. The random matrices have i.i.d. rows 
where the pre-change and post-change distributions of the 
rows are known to belong to the elliptically contoured family 
but are otherwise unknown. This large scale non-parametric 
sequential detection problem has applications in multivariate 
time-series analysis, stochastic finance, social networks and 
failure detection, among others. In multivariate time-series 
analysis, it is of interest to know if the coefficients of the 
time series has changed over time. In stochastic finance, it 
is of interest to detect a sudden change in the correlation 
between a set of stocks being monitored. In social networks, 
it is of interest to detect an abrupt change in the interaction 
level between a pair of agents. In failure detection, often the 
dynamics of a mechanical structure can be characterized by 
multi-variate data, and a change in the dynamics should be 
detected as quickly as possible. 

In such cases the observations can be described as a 
sequence of random matrices. The rows of these random matri¬ 
ces may correspond to approximately independent realizations 
of p different variables, e.g., sampled over blocks of time or 
sampled in a sequence of repeated experiments. For example, 
in the case of detecting a change in the coefficients of a 
Gaussian univariate time series, p successive time samples may 
be acquired over n well separated blocks of time. A change 
in the coefficients of the time series is reflected in a change in 
the correlation matrix associated with each block. In stochastic 
finance, we may have access to multiple instances of stock 


values over a day or week, and a change in correlation may 
occur only at the end of the day or week. 

In this paper we consider the problem of quickest detection 
of a change in population dispersion (or correlation) matrix 
under the assumption of elliptically contoured distribution of 
the rows of the sequence of nxp random matrices. The results 
in this paper hold for the big data regime of p n for which 
p —> oo and n is fixed and small. The precise mathematical 
problem is stated in Section [II] 

If a parametric model for the data is known before and after 
change, then various efficient procedures from the quickest 
change detection literature (see, e.g., (T), |2}, and {3|) can be 
used for detection. However, in the absence of a parametric 
model, a situation common in Big Data settings, no optimal 
procedures are known. In this paper we propose a technique 
for quickest change detection in this setting. 

Specifically, we propose a novel summary statistic for 
the data matrix: the minimal /.'-nearest neighborhood of the 
columns of the random matrix under a correlation magnitude 
distance. We obtain an approximate distribution for the sum¬ 
mary statistic in the big data regime. We show that the dis¬ 
tribution of the summary statistic belongs to a one-parameter 
exponential family, with the unknown parameter a function 
of the underlying distribution of the data matrix. We then 
treat the sequence of summary statistics as our observation 
sequence, and apply Lorden’s test [4j. This work is motivated 
by the theory of correlation screening and correlation mining 
[5], and specifically the theory of hub discovery in large scale 
correlation graphs from |6|. 

II. Problem Description 

A decision-maker sequentially acquires samples from a fam¬ 
ily of distributions of nxp random matrices over time, indexed 
by to, leading to the random matrix sequence {X(TO)} m >i, 
called data matrices. For each to the random matrix X(to) has 
the following properties. Each of its n rows is an independent 
identically distributed (i.i.d.) sample of a p-variate random 
vector X(m) = [Xi (to), • • • , X p (m)} T with p x 1 mean /r m 
and pxp positive definite dispersion matrix £ m . The random 
vector X(m) has an elliptically contoured density, also called 
an elliptical density f7j, 

/X(m)(x) = ffm((x - Mm) TS m 1 ( x ~ H m )), 


for some nonnegative strictly decreasing function g m on ]R + . 
If n m = 0 and S m = I p , where I p is the pxp identity matrix, 
then the random vector X(m) is said to have a spherical 
density. 

The samples {X(m)} are assumed to be statistically inde¬ 
pendent. For some time parameter 7 the samples are assumed 
to have common dispersion parameter S 0 and function g 0 
for to < 7 and common dispersion parameter Si / Eo 
and function g\ for m > 7. 7 is called the change point 
and the pre-change and post-change distributions of X(to) are 
denoted and /^, respectively. No assumptions are made 
about the mean parameter fj, m , and can take different values 
for different to. More specifically, as the rows of X(m) are 
i.i.d. realizations of the elliptically distributed random variable 
X(m), this change-point model is described by: 

X M - /xW = So((x - pi m ) T E 0 -1 (x - n m )), to < 7 
~ /x(x) =5i((x- pt m ) T Si" 1 (x-pt m )), TO > 7. 

( 1 ) 

At each time point to the decision-maker decides to either stop 
sampling, declaring that the change has occurred, i.e., to > 7, 
or to continue sampling. The decision to stop at time to is only 
a function of (X(l), • • • ,X(m)). Thus, the time at which the 
decision-maker decides to stop sampling is a stopping time for 
the matrix sequence {X(to)}. The decision-maker’s objective 
is to detect this change in distribution of the data matrices as 
quickly as possible, subject to a constraint on the false alarm 
rate. 

The above detection problem is an example of the quickest 
change detection (QCD) problem. See (2), (T), and (3} for 
an overview of the QCD literature. In the QCD problem the 
objective is to find a stopping time r on the sequence of data 
matrices {X(m)}, so as to minimize a suitable metric on the 
delay (r — 7), subject to a constraint on a suitable metric on 
the event of false alarm {r < 7 }. This paper follows the QCD 
formulation of Poliak ( 8 ): 

min sup E 7 [t — 7 |t > 7 ] 
r 7>1 (2) 

subj. to E 0 c[r]>/3, 

where E 7 is the expectation with respect to the probability 
measure under which the change occurs at 7, E^ is the 
corresponding expectation when the change never occurs, and 
j3 > 1 is a user-specified constraint on the mean time to false 
alarm. 

If the pre- and post-change densities /■£■ and are known 
to the decision maker, and fi m is constant before and after 
change, then algorithms like the Cumulative Sum (CuSum) 
algorithm (9j, |4) , |T0) , or the Shiryaev-Roberts (SR) family 
of algorithms [[TTj, ( 8 J, & can be used for efficient change 
detection. Both the CuSum algorithm and the SR family of 
algorithms have strong optimality properties with respect to 
both the popular formulations of Lorden ||4) and that of Poliak 
Q, used in this paper. 

If only the pre-change and post-change functions go and 
gi are known then ([2ji i s a parametric QCD problem. In this 


case, under the assumption that pL m = // (l , to < 7, and 2 0 
are known, efficient QCD algorithms can be designed, having 
strong asymptotic optimality properties, based on, e.g., the 
generalized likelihood ratio (GLR) technique [3|, the mixture 
based technique |3), or the nonanticipating estimation based 

technique 03 - 

In many situations, however, even the pre- and post-change 
functions go and g\ may be unknown. This is the non- 
parametric QCD setting considered in this paper. While one 
can use non-parametric QCD tests based on signs and ranks 
03 ’ or based on empirical distribution estimates fl5] , there 
are no known optimal solutions to 0 in the non-parametric 
setting. 

In this paper we provide an asymptotically optimal solution 
to the minimax QCD problem |2| in the random matrix setting 
0 using recently developed large scale random matrix theory 
| 6 |. The solution is optimal in the following sense. The theory 
from @ establishes that a certain summary statistic, denoted 
by V(X), derived from an n x p random matrix X has a 
limiting distribution as p —» 00 for fixed n, the so-called 
’’purely high dimensional regime” m- This summary statistic 
is related to the empirical distribution of the vertex degree of 
the correlation graph associated with the thresholded sample 
correlation matrix. Below we show that the distribution of the 
statistic V(X) converges to a parametric distribution in the 
exponential family in this purely high dimensional regime. We 
then apply the GLR based Lorden’s test |4| to the sequence of 
summary statistics {V (X(to))} to detect the change efficiently. 
Thus, the proposed stopping rule is asymptotically optimal 
under the Lorden minimax quickest change detection (QCD) 
model j4j, and hence also in terms of solving <[ 2 ]), among 
all rules that are stopping rules for the proposed summary 
statistics sequence. 

III. Summary Statistic for the Data Matrix 

In this section we define a summary statistic V(X) and 
then use the results from | 6 j to obtain its asymptotic density 
in the purely high dimensional regime of p —> 00, n fixed. 
This asymptotic distribution is a member of a one-parameter 
exponential family. 

For an elliptically distributed random data matrix X we write 

X = [Xr, • • • ,X p ] = [Xf 1)> ... ,Xf ra) ] T , 

where X, = 1 X n .j\ T is the i th column and X(a = 

[Xu , • • • , Xi p ] is the i th row. Define the sample covariance 
matrix as 

1 n 

S =^E( X F)- X ) r ( X «- X ). 

z — 1 

where X is the sample mean of the n rows of X. Also define 
the sample correlation matrix as 

R = Ds _ 1 / 2 SD s ~ 1/2 , 

where Da denotes the matrix obtained by zeroing out all but 
the diagonal elements of the matrix A. Note that, under our 



assumption that the dispersion matrix X of the rows of X is 
positive definite, Ds is invertible with probability one. Thus 
Rij, the element in the i th row and the j th column of the 
matrix R, is the sample correlation coefficient between the 
i th and j th columns of X. 

Define 4 n (i) to be the sample correlation between the ?'-th 
column of X and its fc-th nearest neighbor in the columns of 
X (in terms of Euclidean distance): 

4 nW := k th largest order statistic of {|R.y \;j y (}. 
Then for fixed k , define the summary statistic 

Vfc(X) := maxd^|(i). (3) 

i 

Below we show that the distribution of the statistic 14 can be 
related to the distribution of an integer valued random variable 
Ng tP which we define below. 

For a threshold parameter p £ [0,1] define the correlation 
graph fy p (R) associated with the correlation matrix R as an 
undirected graph with p vertices, each representing a column 
of the data matrix X. An edge is present between vertices i 
and j if the magnitude of the sample correlation coefficient 
between the i th and j th components of the random vector X 
is greater than p, i.e., if R, y | > p, i ^ j. We define 8, to 
be the degree of vertex i in the graph t/ p (R). For a positive 
integer 8 < p — 1 we say that a vertex i in the graph U P (R) 
is a hub of degree 8 if 4: 4 <5- We denote by Ng tP the total 
number of hubs in the correlation graph f/ p (R), i.e., 

Ng tP = cardji : Si > 8}. 

The events |I4(X) > p] and {Ng iP > 0} are equivalent. 
Hence 

P(V S (X) >p) = P (N s , p > 0). (4) 

An asymptotic approximation to the probability P(Ng ,p > 
0) is obtained in | 6 j by relating Ng iP to a Poisson random 
variable in the purely high dimensional limit as p —> oo and n 
fixed. We summarize the approximation in the theorem below. 
We say that a matrix is row sparse of degree k if there are no 
more than k nonzero entries in any row. We say that a matrix is 
block sparse of degree k if the matrix can be reduced to block 
diagonal form having a single k x k block, via row-column 
permutations. 

Theorem 3.1 ( 0/): Let X be row sparse of degree k = o{p). 
Also let p —>■ oo and p = p p —> 1 such that p 1 ^ 5 (p — 1)(1 — 
p 2 )(n- 2)/2 _j. €n S e ( 0 , oo). 

1 ) 

P {Ng'f, > 0) 1 - exp(—AJx/</>(4)), 

where 

A = lim A(p) = {(e n ja n )/(n - 2)) 5 /<5!, 

p—>oo,p—>-1 

with 

A(p)=p( P " 1 )Po(p) 5 , 


Po(p) =a n f (1 - u 2 Y 
J p 


du , 


a n = 2B((n—2)/2, 1/2) with B(l,m) the beta function, 

(j>{5) = 2 if 8 = 1, cj)(8) = 1 otherwise, and Jx is a 
positive real number that is a function of the joint density 
of X. 

2) If the dispersion matrix S of the p-variate vector X is 
block sparse of degree k, then 

J x = 1 + 0{{k/p) s+1 ). 

In particular, if the dispersion matrix X is diagonal then 

Jx = 1 . 

Using 0 and Theorem the large p distribution of Un¬ 
defined in ([3]) can be approximated, for k = 8. by 

P(I4(X) < p) = exp(—A(p) Jx//>(<5)), P e [0,1], (5) 


where A(p) is as defined in Theorem 3.1 Although the 
theorem is valid for large values of p, numerical experiments 
|6| have shown that the approximation remains accurate for 
smaller values of p as long as n is small and p n. 

The distribution 0 is differentiable everywhere except at 
p = 0 since P(Vg(X) = 0) > 0 when using the finite p and 
p < 1 approximation A p for A specified in Theorem 3.1 For 
p > 0 and large p, Vg has density 


A '(P) 


m 


A (P) 


fv{p) = --7777 exp [-—7jr Jx , P G (0,1]. (6) 


m 


Note that fv in 0 is the density of the Febesgue continuous 
component of the distribution 0 and that it integrates to 1 — 
0(e~ p2 ) over p £ (0,1]. 

The density fy is a member of a one-parameter exponential 
family with Jx as the unknown parameter. This follows from 
the relations below. First 

8 


A (p) = p{^ ^ ^ ^a n J (1 - u 2 ) 2 du 

= CT(p ) s , 


where 


C = C P 

does not depend on p, and 


T{p) = f 

Jn 


P -1 
5 


(1 — u 2 )^^ du. 


(7) 


( 8 ) 


(9) 


Using 0 and noting that T{p)' = —(1 — , we have for 

p £ [0,1], the exponential family form of the density fy with 
parameter Jx: 

fv(p\ Jx) 

CS 


= W) T ^ )S Jxexp ( _ 


CT(p? 

m 


Jx 


( 10 ) 


The constant 8 in © is a fixed design parameter that 
can be selected to maximize change detection performance 








according to Q. In the sequel, we fix 5 = 1. For this value of 
<5, the statistic Vs reduces to the nearest neighbor (correlation) 
distance 


V(X) = max|R,j| 


and the density in ( [T()| reduces to 

C 4 

fv(p; J) = -y(l - J exp 




(ID 


pe (o,i], 
( 12 ) 
exponential 


where we have suppressed subscript X in the 
family parameter J on the distribution of X. 

In Fig. [T] is plotted the density fy for various values of J 
for n = 10, and p = 100. We note that for the chosen values 
of n and p, the density is concentrated close to 1 , consistent 
with large values of p arising in the purely high dimensional 


regime assumed in Theorem 3.1 



p 


Fig. 1. Plot of density fy in © for various values of the parameter J for 
n = 10, p = 100. This is the density of the summary statistic used to detect 
the change in covariance of the random matrix sequence X. 


IV. QCD FOR LARGE SCALE RANDOM MATRICES 

Here we apply the asymptotic results derived in Section III 
to quickest change detection of the distribution of the sum¬ 
mary statistic V. Assume that both the pre- and post-change 
dispersion matrices. So and Si, are row sparse with degree 
k = o(p), and map the data matrix sequence to the sequence of 
summary statistics {V 5 (X(?n))} m >i. For simplicity we refer 
to this sequence by {V(m)}. Let J 0 and Ji be the value of 
parameter J before and after change, respectively. The QCD 
problem on the density /x, depicted in <|TJ. is reduced to the 
QCD problem on the density fy: 

V(m) ~ /y(-; J 0 ), to < 7 
~ m > 7- 

We recall from Theorem o that if the dispersion matrix 
So is diagonal then Jq = 1. Thus, if the pre-change dispersion 
matrix is diagonal, then the QCD problem reduces to the para¬ 
metric QCD problem with unknown post-change parameter J: 


(13) 


V(m) 


/y(-; 1), 


TO < 7 

J ^ 1, TO > 7 . 


(14) 


If the dispersion matrix S 0 is only block sparse with degree 
k -C p, by assertion 2 of Theorem 3.1 we can use the 
approximation Jq « 1 . 


Consider the following QCD test, defined by the stopping 
time r G : 


t g = inf 

m> 1 


{ m 

max sup > log 


Mm i) 



(15) 


where A and e > 0 are user-defined parameters. The parameter 
A is a threshold used to control the false alarm rate. The 
parameter e represents the minimum magnitude of change, 
away from J = 1, that the user wishes to detect. 

The stopping rule r G was shown to be asymptotically 
optimal in j4j| for a related QCD problem when 1) the marginal 
density fy{v,-) of the observation sequence {V(m)} is of 
known form that is a member of a one-parameter exponential 
family and 2 ) when the parameter Jq of the pre-change 
density is known. Both of these properties are satisfied for 
the summary statistic V = V (X) for the QCD model in ( [T4l > 
defined above, since J 0 = 1. Due to the results in 0 ’ the 
stopping rule r G is asymptotically optimal for the problem in 
0 as well. 

The following theorem establishes strong asymptotic opti¬ 
mality of this test. 

Theorem 4.1 ( & Fix any e > 0. 

1) For the stopping rule r G , the supremum in ([2]i is achieved 
at 7 = 1 , i.e., 

sup E 7 [t g - 7|t g > 7] = Ei[t g - 1 ], 

7>1 

2) Setting A = log [3 ensures that as j3 —> 00, 

Eoo[t g ] > /3(l + o(l)), 

and for each possible true post-change parameter J, with 

\J-l\>e, 

Ei[Tg] = ^ (1 + ° (1)) 

= a su P E 7l r -7k >7](l + o(l)), 

(16) 

where I{J) is the Kullback-Leibler divergence between 
the densities fy(-',J ) and fy(-; 1 ). 


Theorem 4.1 implies that the stopping rule r G is uniformly 
asymptotically optimal for each post-change parameter J, as 
long as | J — 1| > e. For convenience of implementation one 
can also use the window limited variation of r G as suggested 
in 0 . 

V. Numerical Results 


Here we apply the stopping rule r G in ( fl5] > to the problem 
of detecting a change in the distribution when the (X(to)} 
are Gaussian distributed random matrices. In this case the 
dispersion S is the covariance matrix of the rows of X. 
The pre-change covariance is the p x p diagonal matrix 
S 0 = diag(of), where af > 0 are arbitrary component-wise 
variances. The post-change covariance matrix Si is obtained 
by replacing the k x k top left block of the identify matrix 















Ip by a sample from the Wishart distribution. We set n = 10, 
p = 100, and k = 5. 

To implement r G we have chosen e = 1.5, and we use the 
the maximum likelihood estimator which, as a function of m 
samples (V(l),--- ,V(m)) from /y(-, J), is given by 

; ( ni),-,r W )- c, E „; r(n<)) . <n> 

Specifically, 


arg max 

J:J> 2.5 



Mm 1) 


( 18 ) 


= max{2.5, J(V(£), ■ ■ ■ 

In Fig. [2] we plot the delay (Ei[r]) vs the log of mean time to 
false alarm (logEoo^]) for various values of the post-change 
parameter J. The values in the figure are obtained by choosing 
different values of the threshold A and estimating the delay 
by choosing the change point 7 = 1 and simulating the test 
for 500 sample paths. The mean time to false alarm values 
are estimated by simulating the test for 1500 sample paths. 
The parameter J for the post-change distribution is estimated 
using the maximum likelihood estimator (fTTj). As predicted by 



log(E co [T]) ->- 


Fig. 2. The empirical mean time to detect vs mean time to false alarm (in 
log scale). The mean time to detect decreases as the parameter J increases. 

the theory, the delay vs log of false alarm trade-off curve is 
approximately linear. For larger values of J, the Kullback- 
Leibler (K-L) divergence between ) and /y(-,l) is 

larger, resulting in smaller delays. For the chosen values of 
the post-change parameters J = 1.73, 2.9, 9.45 and 16.54, 
the corresponding K-L divergence values /(J) are 0.127, 0.41, 
1.35 and 1.86, respectively. 

In Fig. [3] we compare the delay vs false alarm trade-off 
curve for the post-change parameter J = 2.9 plotted in Fig. [2] 
with the values predicted by the theory: ■ We see from 

Fig. [3]that the predictions are quite accurate. We have obtained 
similar results when the test was simulated for different block 
sizes k. Thus, the change can be efficiently detected using our 
proposed methodology. 

VI. Conclusions and Future Work 

We have introduced a novel summary statistic based on 
correlation mining and hub discovery for performing non- 
parametric quickest change detection (QCD) on a sequence 



Fig. 3. Comparison of the delay vs false alarm trade-off curve for J = 2.9 
from Figplwith the values predicted by the theory: = los 0 E ” ^ . 


of large scale random matrices. The proposed QCD algorithm 
is strongly optimal in the sense of Lorden |4J and Poliak 
among all detection algorithms that use our summary statistic. 
Future work will include extensions to local summary statistics 
and experiments with QCD in real applications that yield 
sequences of large scale random matrix measurements. 
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